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Abstract 

-^ In this paper we address the problem of estimating the ratio - where p is a density function and q is 

,_^ another density, or, more generally an arbitrary function. Knowing or approximating this ratio is needed 

f~^ in various problems of inference and integration, in particular, when one needs to average a function with 

^S) respect to one probability distribution, given a sample from another. It is often referred as importance 

t I sampling in statistical inference and is also closely related to the problem of covanate shift in transfer 

^ I learning as well as to various MCMC methods. It may also be useful for separating the underlying 

.^^ geometry of a space, say a manifold, from the density function defined on it. 

Our approach is based on reformulating the problem of estimating - as an inverse problem in terms 
of an integral operator corresponding to a kernel, and thus reducing it to an integral equation, known as 
the Fredholm problem of the first kind. This formulation, combined with the techniques of regularization 
^.^ and kernel methods, leads to a principled kernel-based framework for constructing algorithms and for 

r ^ analyzing them theoretically. 

The resulting family of algorithms (FIRE, for Fredholm Inverse Regularized Estimator) is flexible, 
simple and easy to implement. 



in 



^ We provide detailed theoretical analysis including concentration bounds and convergence rates for the 

I ^^1 Gaussian kernel in the case of densities defined on R"*, compact domains in R'* and smooth d-dimensional 

sub-manifolds of the Euclidean space. 
CN| We also show experimental results including applications to classification and semi-supervised learning 

^ within the covariate shift framework and demonstrate some encouraging experimental comparisons. We 

1*1 also show how the parameters of our algorithms can be chosen in a completely unsupervised manner. 

in 

^ 1 Introduction 

^D Density estimation is one of the best-studied and most useful problems in statistical inference. The question 

^' is to estimate the probability density function p{x) from a sample xi, . . . ,Xn- There is a rich literature on 

. . the subject (e.g., see the review jllj). particularly, dealing with a class of non-parametric kernel estimators 

^ going back to the work of Parzen [W . 

K^ In this paper we address the related problem of estimating the ratio of two functions, 4^ where p is 

JH given by a sample and q{x) is either a known function or another probability density function given by a 

sample. We note that estimating such ratio is necessary when one attempts to integrate a function with 
respect to one density, given its values on a sample obtained from another distribution. This is typical when 
the process generating the data is different from the averaging problem we wish to address. To give a very 
simple practical example of such a situation, consider a cleaning robot equipped with a dirt sensor. We would 
like to know how well the robot performs cleaning, however, the probability density of the robot location 
p{x) depends on the program and is clearly not uniform. To obtain the cleaning quality, we need to average 
the sensor readings with respect to the uniform density over the floor rather than the location distribution, 
which requires estimating the inverse - (here q{x) it the constant function 1). 

An important class of applications for density ratios relates to various Markov Chain Monte Carlo 
(MCMC) integration techniques used in various applications, in particular, in many tasks of Bayesian infer- 
ence. It is often hard to sample directly from the desirable probability distribution but it may be possible to 
construct an approximation which is easier to sample from. The class of techniques related to the importance 
sampling (see, e.g., [16]) deals with this problem by using a ratio of two densities (which is typically assumed 
to be known in that literature). 



Recently there have been a significant amount of work on estimating the density ratio (also known as te 
importance function) from sampled data, e.g., [3 [131 UHl [211 E] • Many of these papers consider this problem 
in the context of covariate shift assumption [24j or the so-called selection bias |36j . Our Fredholm Inverse 
Regularized Estimator (FIRE) framework introduces a very general and flexible approach to this problem 
which leads to more efficient algorithms design, provides very competitive experimental results and makes 
possible theoretical analysis in terms of the sample complexity and convergence rates. 

We will provide a more detailed discussion of these and other related papers and connections to our work 
in Section [2] where we also discuss how the Kernel Mean Matching algorithm [HI [10] can be viewed within 
our framework. 

The approach taken in our paper is based on reformulating the density ratio estimation as an integral 
equation, known as the Fredholm equation of the first kind (in the classical one-dimensional case), and solving 
it using the tools of regularization and Reproducing Kernel Hilbert Spaces. That allows us to develop simple 
and flexible algorithms for density ratio estimation within the popular kernel learning framework. In addition 
the integral operator approach separates estimation and regularization problems, thus allowing us to address 
certain settings where the existing methods are not applicable. The connection to the classical operator 
theory setting makes it easier to apply the standard tools of spectral analysis to obtain theoretical results. 

We will now briefly outline the main idea of this paper. We start with the following simple equality 
underlying the importance sampling method: 

E,ih) = J h{x)q{x)dx = J h{x)^pix)dx = Ep (h^)^) (1) 

By replacing the function h(x) with a kernel k{x,y), we obtain 

ICp-{x) := / k{x,y)——p{y)dy^ / k{x,y)q{y)dy -.^ lCql{x). (2) 

Thinking of the function ^4^ as an unknown quantity and assuming that the right hand side is known this 
becomes an integral equation (known as the Fredholm equation of the first type). Note that the right-hand 
side can be estimated given a sample from q while the operator on the left can be estimated using a sample 
from p. 



To push this idea further, suppose kt{x, y) is a "local" kernel, (e.g., the Gaussian, kt{x, y) = ,2^\d/2 e ^* 
such that Ld kt{x, y)dx — 1. Convolution with such a kernel is close to the (5- function, i.e., L^ kt{x, y)f{x)dx = 

f{y) + o{t). 

Thus we get another (approximate) integral equality: 

K^t,p-{y) -^ kt{x,y)-f-^p{x)dxpiq{y). (3) 

P Jri P[x) 

It becomes an integral equation for ^y^, assuming that q is known or can be approximated. 

We address these inverse problems by formulating them within the classical frameworUM of Tiknonov- 
Philips regularization with the penalty term corresponding to the norm of the function in the Reproducing 
Kernel Hilbert Space H with kernel k-u used in many machine learning algorithms. 

[Typel]: ^«argmin|l/Cp/-/C,l|li^^+A|l/|l2^ [Type II]: ^ « argmin H/C^.p/ - g|li^^^ + AH/H?, 

jy J t n. jj J t n. 

Importantly, given a sample xi, . . . ,Xn from p, the integral operator ICpf applied to a function / can be 
approximated by the corresponding discrete sum JCpf{x) w ^ J^i f{xi)K{xi, x), while L2,p norm is approxi- 
mated by an average: H/Hl^ ~ „ ^i /(^»)^- Of course, the same holds for a sample from q. 

Thus, we see that the Type I formulation is useful when g is a density and samples from both p and q 
are available, while the Type II is useful, when the values of q (which does not have to be a density function 
at alrl) are known at the data points sampled from p. 

Since all of these involve only function evaluations at the sample points, by an application of the usual 
representer theorem for Reproducing Kernel Hilbert Spaces, both Type I and II formulations lead to simple, 
explicit and easily implementable algorithms, representing the solution of the optimization problem as linear 
combinations of the kernels over the points of the sample X)i aikH{xi,x) (see Sectionlsl). We call the resulting 
algorithms FIRE for Fredholm Inverse Regularized Estimator. 



^In fact our formulation is quite close to the original formulation of Tikhonov. 

^This could be important in various sampling procedures, for example, when the normalizing coefficients are hard to estimate. 



Some remarks would be useful at this point. 

Remark 1: Other norms and loss functions. Norms and loss functions other that L2^p can also be used 

in our setting as long as they can be approximated from a sample using function evaluations. 

• Perhaps, the most interesting is the norm L2^q norm available in the Type I setting, when a sample 
from the probability distribution q is available. In fact, given a sample from both p and q we can 
use the combined empirical norm 7i2,p + (1 ~ l)L2,q- Optimization using those norms leads to some 
interesting looking kernel algorithms described in Section [3] We note that the solution is still a linear 
combination of kernel functions on centered on the sample from p and can still be written explicitly. 

• In the Type I formulation, if the kernels k(x,y) and kf{(x,y) coincide, it is possible to use the RKHS 
norm || • \\fi instead of ^2,^- This formulation (see Section Isl) also yields an explicit formula and 
is related to the Kernel Mean Matching algorithm [TU] (see the discussion in Section [2]) , although 
with a different optimization procedure. We note that the solution in our framework has a natural 
out-of-sample extension, which becomes important for proper parameter selection. 

• Other norms/loss functions, e.g., ii,p, Li,q, e-insensitive loss from the SVM regression, etc., can also be 
used in our framework as long as they can be approximated from a sample using function evaluations. 
We note that some of these may have advantages in terms of the sparsity of the resulting solution. 
On the other hand, a standard advantage of using the square norm is the ease of cross-validation with 
respect to the parameter A. 

Remark 2: Other regularization methods. Several regularization methods other than Tikhonov- 
Philips regularization are available. We will briefly discuss the spectral cut-off regularization and its potential 
advantages in Sectionis] We note that other methods, such as early stopping (e.g., [34l[T]) can be used and 
may have computational advantages. 

Remark 3. We note that an intermediate "Type 1.5" formulation is also available. Specifically, for two 
"(5-kernels" K and K', we have JCp- ~ /C' 1, thus using two different kernels in the Type I formulation 

^«argmin||/C,/- /Chilli, ^+A||/||?, (4) 

The ability to use kernels with different bandwidth for p and q may be potentially important in practice, 
especially when the samples from p and q have very different cardinality. The resulting algorithms for this 
setting are described in in Section [3J Of course, the previous two remarks apply to this setting as well. 

Since we are dealing with a classical inverse problem for integral operators, our formulation allows for 
theoretical analysis using the methods of spectral theory. In Section [4] we prove concentration and error 
bounds as well as convergence rates for our algorithms when data are sampled from a distribution defined 
in M*^, a domain in M.'^ with boundary or a compact d-dimensional sub- manifold of a Euclidean space M.^ for 
the case of the Gaussian kernel. 

It is interesting to note that unlike the usual density estimation problem the width of the kernel does not 
need to go to zero for convergence. However, it is necessary if we want a polynomial convergence rate. This 
is related to the exponential decay of eigenvalues for the Gaussian kernel. 

Finally, in Section [6] we discuss the experimental results on several data sets comparing our method FIRE 
with the available alternatives, Kernel Mean Matching (KMM) [TU] and LSIF [T3] as well as the base-line 
thresholded inverse kernel density estimatoiP] (TIKDE) and importance sampling (when available). 

We summarize the contributions of the paper as follows: 

1. We provide a formulation of estimating the density ratio (importance function) as a classical inverse 
problem, known as the Fredholm equation, establishing a connections to the methods of classical 
analysis. The underlying idea is to "linearize" the properties of the density by studying an associated 
integral operator. 

2. To solve the resulting inverse problems we apply regularization with an RKHS norm penalty. This 
provides a flexible and principled framework, with a variety of different norms and regularization 
techniques available. It separates the underlying inverse problem from the necessary regularization 
and leads to a family of very simple and direct algorithms within the kernel learning framework in 
machine learning. We call the resulting algorithms FIRE for Fredholm Inverse Regularized Estimator. 



^Obtained by dividing the standard kernel density estimator for g by a thresholded kernel density estimator for p Interestingly, 
despite its simplicity it performs quite well in many settings. 



3. Using the techniques of spectral analysis and concentration, we provide a detailed theoretical analysis 
for the case of the Gaussian kernel, for Euclidean case as well as distributions supported on a sub- 
manifold. We prove error bounds and as well as the convergence rates (as far as we know, it is the 
first convergence rate analysis for density ratio estimation). We also comment on other kernels and 
potential extensions of our analysis. 

4. We evaluate and compare our methods on several real-world and artificial different data sets and in 
several settings and demonstrate strong performance and better computational efficiency compared 
to the alternatives. We also propose a completely unsupervised technique for cross-validating the 
parameters of our algorithm and demonstrate its usefulness, thus addressing in our setting one of the 
most thorny issues in unsupervised/semi-supervised learning. 

5. Finally, our framework allows us to address several different settings related to a number of problems 
in areas from covariate shift classification in transfer learning to importance sampling in MCMC to 
geometry estimation and numerical integration. Some of these connections are explored in this paper 
and some we hope to address in the future work. 

2 Related work 

The problem of density estimation has a long history in classical statistical literature and a rich variety of 
methods are available [llj . However, as far as we know the problem of estimating the inverse density or 
density ratio from a sample has not been studied extensively until quite recently. Some of the related older 
work includes density estimation for inverse problems [7] and the literature on deconvolution, e.g., 0]. 

In the last few years the problem of density ratio estimation has received significant attention due in 
part to the increased interest in transfer learning |19j and, in particular to the form of transfer learning 
known as covariate shift 24 . To give a brief summary, given the feature space X and the label space F, two 
probability distributions p and q on X y.Y satisfy the covariate assumption if for all x, y, p{y\x) = q{y\x). It 
is easy to see that training a classifier to minimize the error for q, given a sample from p requires estimating 
the ratio of the marginal distributions '^^ f^l . Some of the work on covariate shift, ratio density estimation 
and other closely related settings includes [36 l l3l[8llT3 l [28 l lT0 l [29 l lT2 t [T8] 

The algorithm most closely related to our approach is Kernel Mean Matching (KMM) [10]. KMM is 
based on the observation that Eq{^{x)) = Ep{^^{x)), where $ is the feature map corresponding to an 
RKHS H. It is rewritten as an optimization problem 

44=arg min \\E,mx)) - Ep{P{xMx))\\n (5) 

The quantity on the right can be estimated given a sample from p and a sample from q and the minimization 
becomes a quadratic optimization problem over the values of /3 at the points sampled from p. Writing 
down the feature map explicitly, i.e., recalling that $(a;) = K'n{x,-), we see that the equality Eq(^{x)) — 
Ep{-^{x)) is equivalent to the integral equation Eq. [21 considered as an identity in the Hilbert space "H. 
Thus the problem of KMM can be viewed within our setting Type I (see the Remark 2 in the introduction) , 
with a RKHS norm but a different optimization algorithm. 

However, while the KMM optimization problem in Eq. [5] uses the RKHS norm, the weight function (3 
itself is not in the RKHS. Thus, unlike most other algorithms in the RKHS framework (in particular, FIRE), 
the empirical optimization problem resulting from Eq. [5j does not have a natural out-of-sample extensioij^ 

Also, since there is no regularizing term, the problem is less stable (see Section p^ for some experimental 
comparisons) and the theoretical analysis is harder (however, see [S] and the recent paper ^35] for some nice 
theoretical analysis of KMM in certain settings). 

Another related recent algorithm is Least Squares Importance Sampling (LSIF) [13j . which attempts to 
estimate the density ratio by choosing a parametric linear family of functions and choosing a function from 
this family to minimize the L2,p distance to the density ratio. A similar setting with the Kullback-Leibler 
distance (KLIEP) was proposed in ,29]. This has an advantage of a natural out-of-sample extension property. 
We note that our method for unsupervised parameter selection in SectionlGlis related to their ideas. However, 
in our case the set of test functions does not need to form a good basis since no approximation is required. 

We note that our methods are closely related to a large body of work on kernel methods in machine 
learning and statistical estimation (e.g., [26l [22l [21] ) . Many of these algorithms can be interpreted as inverse 
problems, e.g., [S] I25j in the Tikhonov regularization or other regularization frameworks. In particular, 



*In particular, this becomes an issue for model selection, see Section pj 



wc note interesting methods for density estimation proposed in |31j and estimating the support of density 
through spectral regularization in [S] , as well as robust density estimation using RKHS formulations |14) and 
conditional density j^. 

We also note the connections of the methods in this paper to properties of density-dependent operators in 
classification and clustering [321 123] ■ There are also connections to geometry and density-dependent norms 
for semi-supervised learning, e.g., [5]. 

Finally, the setting in this paper is connected to the large literature on integral equations ^15j . In 
particular, we note [32j . which analyzes the classical Fredholm problem using regularization for noisy data. 

3 Settings and Algorithms 

3.1 Some preliminaries 

We start by introducing some objects and function spaces important for our development. As usual, the 
space of square-integrable functions with respect to a measure p, is defined as follows: 

This is a Hilbert space with the inner product defined in the usual way by {f,g)2.p — Jfi f{'x)9{x)dp. 
Given a function of two variables k(x,y) (a kernel), we define the operator JCp'. 

^p/(y) := / k{x,y)f{x)dp{x). 
Jn 

We will use the notation ICt^p to explicitly refer to the parameter of the kernel function kt{x,y), when it is 
a (5- family. 

If the function k{x, y) is symmetric and positive definite, then there is a corresponding Reproducing Kernel 
Hilbert space (RKHS) H. We recall the key property of the kernel fc-^: for any f G H, (/, k-u{x^ ■))-u = f{x). 
The direct consequence of this is the Representer Theorem, which allows us to write solutions to various 
optimization problems over "H in terms of linear combinations of kernels supported on sample points (see |26j 
for an in-depth discussion or the RKHS theory and the issues related to learning). 

It is important to note that in some of our algorithms the RKHS kernel fc^ will be different from the 
kernel of the integral operator k. 

Given a sample Xi,...,Xn from p, one can approximate the L2,p norm of a functiorrl / by ||/||2p ~ 
nSil/(^j)P- Similarly, the integral operator Kpf{x) ~ -^J2i^{^ij^)f{^i)- These approximate equalities 
can be made precise by using appropriate concentration inequalities. 

3.2 The FIRE Algorithms 

As discussed in the introduction, the starting point for our development is the integral equality 

[Type I]: Kp^x) ^ f k{x,y)^p(y)dy ^ K,\{x). (6) 

P Jn P{v) 

Notice that in Type I, the kernel is not necessary to be in (5-family. For example, it could be linear kernel. 
Thus, we omit the t in the kernel for the Type I case. 

Moreover, if the kernel kt{x,y) is a Gaussian, which we will analyze in detail, or another ^-family and 
for / sufficiently smooth JCt^qf{x) « f{x)p{x) + o(l) and hence 

[Type II]: /Ci,/(x) = / A:t(x, y)^p(y)d2/ = g(a;) + o(l). (7) 

p Jn pyy) 

In fact, for the Gaussian kernel, the o(l) term is of the order t. Since it is important that the kernel kt is in 
the (5-family with bandwidth i, so we keep t in the notation in this case. 

Assuming that either K,q\. or q are known (for simplicity we will refer to these settings as Type I and 



Type II, respectively) these Eqs. 6 7 become integral equations for ^, known as the Fredholm equations of 
the first kind. 

To address the problem of estimating - we need to obtain an approximation to the solution which 
(a) can be obtained computationally from sampled data, (b) is stable with respect to sampling and other 



^/ needs to be in a function class whore point evaluations arc defined. 



perturbation of the input functiorrl and, preferably, (c) can be analyzed using the standard machinery of 
functional analysis. 

To provide a framework for solving these inverse problems we apply the classical techniques of regular- 
ization combined with the RKHS norm popular in machine learning. In particular a simple formulation of 
Eq|6]in terms of Tikhonov regularization with the L2,p norm is as follows: 

[Type I]: /j; = argmin ||/Cp/ - /C,l|||p + A||/|||, (8) 

Here T-L is an appropriate Reproducing Kernel Hilbert Space. Similarly Eq. [Tlcan be written as 

[Type II]: /j;' = argmin ||/C,,p/ - g||2_^ + A||/||2^ (9) 

We will now discuss the empirical versions of these equations and the resulting algorithms in different 
settings and for different norms. 

3.3 Algorithms for the Type I setting. 

Given an iid sample from p, Zp = {xi, 2:2, . . . , a;„} and an iid sample from q, Zq = {x'^^ Xj, . . . , x',j} (we will 
denote the combined sample by z) we can approximate the integral operators ICp and ICq by 

K.J[x)^-Y.k{x,,x)f{x,) and i^,,/(a;) = - ^ fc(a;:, a:)/(x:). (10) 

Thus the empirical version of Eq. [8] becomes 

/i,,=argmin- ^ ((/C.^/)(x,) - (/C.,l)(a;,))2 + A||/||2, (11) 

Xi£Zp 

We observe that the first term of the optimization problem involves only evaluations of the function / at the 
points of the sample Zp. 

Thus, using the Representer Theorem and the standard matrix algebra manipulation we obtain the 
following solution: 

flA^)^ ^ ku{x,,x)v, andv = [KlpKu+nXl) Kp^pKp^ql^^. (12) 

where the kernel matrices are defined as follows: {Kpp)ij — -k{xi,Xj), {K-^)ij — k-H{xi,Xj) for Xi,Xj G Zp 
and Kp^q is defined as {Kp^q)ij — ^k{xi,x'j) for Xi G Zp and x' e Zq. 

To compute the whole regularization path for all A's, or computing the inverse for every A, we can use 
the following formula for v. 

v = Q{A + nXiy^Q-^Kp^pKp^qU^, 

where Kp K-^ = QAQ^^ is a diagonalizatioiFl of K^ Ky^ (i.e., A is diagonal). 

When K'fi and Kp p are obtained using the same kernel function fc, i.e. -K-^ = ^p,p^ the expression 
simplifies: 

^ = „ i^lp + ^iy^ Kp^pKp^ql,^. 

In that case (or, more, generally, if they commute) the diagonalization is obtained by computing the eigen- 
decomposition of Kp^p = QAQ^, where Q is an orthogonal matrix. Then the solution could be computed 
using the following formula: 

Similarly to many other algorithms based on the square loss function, this formulation allows us to 
efficiently compute the solution for many values of the parameter A simultaneously, which is very useful for 
cross-validation. 



^Especially in Eq. 7 where the identity has an error term depending on t. 

^Strictly speaking, an arbitrary matrix can only be reduced to the Jordan canonical form, but an arbitrarily small pertur- 
bation of any matrix can be diagonalized over the complex numbers. 



3.3.1 Algorithms for 7^2,^ + (1 ^ 7)^2, g norm. 

Depending on the setting, we may want to minimize the error of the estimate over the probabiUty distribution 
p, q or over some Unear combination of these. A significant potential benefit of using a linear combination is 
that both samples can be used at the same time in the loss function. First we state the continuous version 
of the problem: 

/;=argmin l\\ICpf - IC,l\\l^ + {1 - j)\\ICpf - IC,l\\l^ + XWff^ (13) 

Given a sample from p, Zp — {xi,X2, ■ ■ ■ ,Xn} and a sample from q, Zq — {x'^^ x^t ■ ■ , x'^^} we obtain an 
empirical version of the Eq. |13[ 



fev. n -^ — ' V *■ m ^ — ' 



XiEZp oc'-ezg 

Using the Representer Theorem we can derive: 



where 



K = {liKp^P? + ^K.P^'^p) ^« ^nd K, = (j^Kp,pKp.q + i—lKlpKq^,^ 



Here iKp^p),^ = ^k{xi,Xj), {K-h)ij = k-H{xi,Xj) for Xi,Xj e Zp. Kp^q and Kq^p are defined as iKp^q)tj = 
^k{x„x'j) and {Kq^p)ji = ^k{x'j,Xi) for Xi G Zp,x'j G Zq. 

We see that despite the loss function combining both samples, the solution is still a summation of kernels 
over the points in the sample from p. 

3.3.2 Algorithms for the RKHS norm. 

In addition to using the RKHS norm for regularization norm, we can also use it as a loss function: 

/; = argmin||/Cp/ - JCql\\l, + A||/||2, (14) 

Here the Hilbert space H' must correspond to the kernel K and can potentially be different from the space 
T-l used for regularization. Note that this formulation is only applicable in the Type I setting since it requires 
the function q to belong to the RKHS H' . 

Given two samples Zp,Zq, it is straightforward to write down the empirical version of this problem, 
leading to the following formula: 

f\.zix)= Y Vik-H{xi,x) V = {Kp,pK-H+nXiy^ Kp^ql^^. (15) 

Xi£Zp 

The result is somewhat similar to our Type I formulation with the L2,p norm. We note the connection 
between this formulation of using the RKHS norm as a loss function and the KMM algorithm [10]. The 



Eq. 15 can be viewed as a regularized version of KMM (with a different optimization procedure), when the 
kernels K and Kf^ are the same. 

Interestingly a somewhat similar formula arises in [13] as unconstrained LSIF, with a different functional 
basis (kernels centered at the points of the sample Zq) and in a setting not directly related to RKHS inference. 

3.4 Algorithms for the Type II and 1.5 settings. 

In the Type II setting we assume that we have a sample z = {xi, X2, ■ ■ ■ , Xn} drawn from p and that we 
know the function values q{xi) at the points of the sample. 

Replacing the norm and the integral operator with their empirical versions, we obtain the following 
optimization problem: 

/ii, = argmin ^ ^ iJCt,zf{x.) - q{x^)f + A||/||2, (16) 

Recall that ICt.z is the empirical version of ICt^p defined by 



n ^ — ^ 



n 

Xi^Z 



As before, using the Representer Theorem we obtain an analytical formula for the solution: 

fxAx) = J2 ^n{x^,x)v, where v = {K^Kn + nXl)'^ /vq. (17) 

where the kernel matrix K is defined by Kij — -kt{xi,Xj), {K-u)ij — k-u{xi,Xj) and q^ ~ q{xi). 

3.4.1 Type 1.5: The setting and the algorithm. 

This case (see Eq. [4]) is intermediate between Type I and Type II. The setting is the same as in Type I, 
in that we are given two samples Zp from p and Zq from q. But similarly to Type II, we use the fact that 
/Cp- ~ /C' 1 when ICp and /C' are different (5-function-like kernels (e.g., two Gaussians of different bandwidth). 
The algorithm is similar to that for Type I with the difference that the kernel matrix K' is computed using 
the kernel k'{x,y): {K'^^^jij = :^^k'{x„x'j). 

fx,lix)= Y^ kuixi,x)v^ andv^ {KlpK-H+nXl) Kp^pR'^^l^^. 

Xi£Zp 

3.5 Spectral Cutoff Regularization 

In this section we briefly discuss an alternative form of regularization, based on thresholding the spectrum 
of the kernel matrix. It also leads to simple algorithms comparable to those for Tikhonov regularization and 
may have certain computational advantages. 

Since /Cp is a compact self-adjoint operator on L2^p, its eigenfunctions {uq,ui^ . . .} form a complete 
orthogonal basis for 2^2,p- An alternative method of regularization is the so-called spectral cutojf where the 
problem is restricted to the subspace spanned by the top few eigenfunctions of /Cp Thus the regularization 
problems become 

/^'"P'='= = arg min ||/Cp/ - /C,l||^_p 

Jt Hk 

/■II.SPCC ■ 111--- r II 2 

/a == arg nun \\ICt,pf - qh.p 

where T-Lk and lit.k is the finite dimensional subspace of L2,p spanned by the eigenvectors of /Cp and /C^^p 
corresponding to the k largest eigenvalues. 

Without going into detail, it can be seen that the corresponding empirical optimization problems are 



flT^ = arg i^in - E (^-./(^O - ^t.zl{x.)f (18) 

XiEZp 

fx:r' = arg min - J^ (^M./(^0 ^ 9(^.))' (19) 

XiGz 

where the span of eigenvectors of the kernel matrix K is taken instead of the eigenfunctions of /Cp or ]Ct,p- 

For this algorithm, we assume K-u and Ki use the same kernel. Then the solution to the empirical 
regularization problems given in Eqs. 18|19 are respectively 



fx^r^i^)^- Yl Hxt,x)vi 

XiGZp (20) 

V - QkAfQlK2lz, 



fx,z'^'"'{x) = - ^ kt{xi,x)vi 



" ." (21) 

V = QkK^Qlq 

where Ki = QAQ^ is the eigendecomposition of Ki with orthogonal matrix Q and diagonal matrix A, and 
Qk and A^ is the submatrices of Q and A corresponding to the k largest eigenvalues of the kernel matrix Ki 
and the remaining objects are defined in the previous subsection. 

We note that spectral regularization can be faster computationally as it requires to compute only the 
top few eigenvectors of the kernel matrix. There are several efficient algorithms for computing eigen- 
decomposition when only the first k eigenvalues are needed. Thus spectral regularization can be more com- 
putationally efficient than the Tikhonov regularization which potentially requires a full eigen-decomposition 
or matrix multiplication. 



3.6 Comparison of type I and type II settings. 

While at first glance the type II, setting may appear to be more restrictive than type I, there are a number 
of important differences in their applicability. 

1. In Type II setting q does not have to be a density function (i.e., non-negative and integrate to one). 



2. Eq. 11 of the Type I setting cannot be easily solved in the absence of a sample Zq from q, since 
estimating JCq requires either sampling from q (if it is a density) or estimating the integral in some 
other way, which may be difficult in high dimension but perhaps of interest in certain low-dimensional 
application domains. 

3. There are a number of problems (e.g., many problems involving MCMC) where q{x) is known explicitly 
(possibly up to a multiplicative constant), while sampling from q is expensive or even impossible 
computationally |17j . 

4. Unlike Eq. [H) Eq. [9] has an error term depending on the kernel, which is essentially the difference 
between the kernel and the ^-function. For example, in the important case of the Gaussian kernel, the 
error is of the order 0(i), where t is the variance. 

5. While a number of different norms are available in the Type I setting, only the L2,p norm is available 
for Type II. 

4 Theoretical analysis: bounds and convergence rates for Gaus- 
sian Kernels 

In this section, we state our main results on bounds and convergence rates for our algorithm based on 
Tikhonov regularization with a Gaussian kernel. We consider both Type I and Type II settings for the 
Euclidean and manifold cases and make a remark on the Euclidean domains with boundary. 

To simplify the theoretical development the integral operator and the RKHS H will correspond to the 
same Gaussian kernel kt{x,y). Most of the proofs will be given in the next Section [5] We note that two 
Gaussian kernels with different bandwidth parameters can be analyzed using only minor modifications to 
our arguments. 

4.1 Assumptions 

Before proceeding to the main results, we will state the assumptions on the density functions p and q and 
the basic setting for our theorems: 

1. The set J7 where the density function p is defined could be one of the following: (1) the whole W^] 
(2) a compact smooth Riemannian sub- manifold M of d-dimension in M". In both cases, we need 
< p{x) < T for any x G fl. The function q should satisfy q € iv2,p and needs to be bounded from 
above. We will also make some remarks about a compact domain in M.'^ with boundary. 

2. We also require ^4^ S W2{^) and q G IV|(ri), where iy|(i7) is the Sobolev space of functions on fl 
(e.g., [30]). Certain properties of W2{^) will be discussed later in the proof. 

1/2 

It will be important for us that "H is isometric to L2^p under the map /Cp : L2,p — ^ ^, that is, ||/||-h = 
ll^p /I|l2 p for any / € H. Here the integral operator K.p uses the RKHS kernel corresponding to H. 

4.2 Main Theorems 

4.2.1 Type I setting 

We will provide theoretical results for our setting Type I, where both the operator and the regularization 
kernel are Gaussian kt{x, y) — tt^— rw2 e st with the same bandwidth parameter t. 

Theorem 1. Let p and q be two density functions on il and q be another density over fl satisfying the 



assumption in Sec. 4-1 Given n points, Zp = {xi, X2, . . . , Xn}, i.i.d. sampled from p and m points, Zq — 
{x\ , x'2, ■ ■ ■ ,x'j^n}' i-i-d. sampled from q, and for small enough t, for the solution to the optimization problem 



in (11), with confidence at least 1 — 2e '^ , we have 



(1) If the domain Q. isW^, 



f\,z ^ 



T / 1 



+ C3 



Ai'^/2 



+ C2A 2 (Approximating Error) 
(Sampling Error), 



m A^ 



(22) 



where Ci,C2,C3 are constants independent oft,X. 

(2) If the domain fl is a compact manifold without boundary of d dimension, 



f\.: 



<Cit + C2A ' (Approximating Error) 



2,P 



c. 



1 



1 



where Ci,C2,C3 are constants independent oft,\. 
Proof. See Section [5J 



(23) 



(Sampling Error), 



n 



Remark 1: convergence for fixed t. For the Euclidean case in Eq. [22j with fixed kernel width t, the error 
will converge to 0, as A —> given sufficiently many data points. However the required number of points is 
exponential in -^. This is related to the fact the eigen- values of the operator K.p decay exponentially fast, 
when the kernel is Gaussian. On the other hand choosing both t and A as a function of n leads to a much 
better polynomial rate given below. 



Remark 2. A minor modification of the proof provides the following simpler version of Eq. 22 



fx 



< Citi +C2A5 +C3 



2,P 



Ai'^/2 



1 



1 

AVS 



(24) 



As a consequence we obtain the following corollary establishing the convergence rates: 
Corollary 2. Assuming m > X^'^n, with confidence at least 1 — 26^"^, we have the following: 

(1) ifn -M^ 



Ix.z ^ 



P 



2,P 



=^ O (y/rn 3.5=+d) 



(2) If Q is a d-dimensional sub-m,anifold of a Euclidean space, 

2 



O 



fn 3.5 + (i/2 



2,P 

+^ A - 



Proof. For the Euclidean space, set t ~ n 3 '+'' , A = n 3' 
Euclidean case) . For the sub-manifold case set t = n^ '^^ , A == n^ "^ . 



and apply Theorem 111 (Eq, 







24 



for the 

n 



4.2.2 Type II setting 

In this section we provide an analysis for the Type II setting and also make a remark about the error analysis 
for the compact domains in W^. 

Recall that in Type II setting we have a set of points sampled from p and assume that the values of q on 
those points are known. Note, that q does not have to be a density function. 



Tiieorem 3. Let p he a density function on 57 and q he a function satisfying the assumptions in Sec. \41\ 
Given n points z = {xi, X2, ■ ■ ■ , Xn} sampled i.i.d. from p, and for sufficiently small t, for the solution to 
the optimization problem in (16), with confidence at least 1 — 26""^, we have 



(1) If the domain i7 is M , 



pii 



<Ci 



2,P 



log A 



+ C2A^+C3A-3|1/Ct,,l 



9ll2,p + ^4^3/2^,/2/^, 



(25) 



where Ci,C2, 6*3,(74 are constants independent oft,X. Moreover, \\ICt^ql ~ 9II2 = 0(t). 
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(2) If fl is a d-dimensional sub-manifold of a Euclidean space, 



l\,z 



<C^t + C2AI/2 + CaA-^ ||/C,,,1 - g|L „ + d 



^ 



2,P 



where Ci,C2,C3,C4 are constants independent oft,\. Moreover, \\Kt^q^ — 9II2 = 0(t^^^) for any e > 






(26) 



Remark. It can be shown that if 51 is a compact subset with sufficiently smooth boundary in M"^, wc have 
the same bound with (I) except for ||/Ct,ql — q\\2 — 0{ti^^) for any any e > 0. 
As before, we obtain the rates as a coroUary: 

Corollary 4. With confidence at least 1 — 2e"'^, we have: 

(1) ifn = R^, 



p 



2,P 



O VTn 



(2) If ^ is a d-dimensional sub-manifold of a Euclidean space, than for any < £ < 1 

2 



rii q 



= / 



2,P 



4-4e+jJci 



Proof. For the case of M , set t = n '^■f^+<^,X = n *+ e '^ . For case of sub-manifold case, set t = n 4.s-4.8e+d^^ 



Apply Theorem 



B 



n 



5 Proofs of Theorems 

In this section, we provide a proof for the our main Theorem IT] for setting I. The proof for the Theorem [3] 
for the setting type II is along similar lines and can be found in the appendix. 



5.1 Basics about RKHS 

Since K,t,p is a self-adjoint operator, its eigenfunctions {uo.t, ui.t, . . . } form a complete orthogonal basis for 
i2,p- Denote the eigenvalues oi Kt.p by {cro,t,cro,t, • • • }• The norm oi JCt.p, ||^i,p||L2.p^-L2.p < max^ CTi^f < c 

1 /2 

for a constant c. We know that Ht is isometric to i2.p under the map /Cj „ : L2.P — > Ht, i.e. ||/||iff =: 

— 1/2 

||/Cj p /I1l2 p for E^ny f ^ Ht, and this is the definition we use for the norm || • \\}jt oi Ht. This also implies 

1/2 

that ||/Cj /IIlq < 00 for any / G Ht. And /Cj p is defined using the spectrum of /Cf,p, 



K^t,pf 



E 



Ci,p{f,Ui^t)Ui,t 



5.2 Proof of Theorem [T] 

Proof. Recall the definition of f\ and f\ ^ in Eq. p^ and Eq. fill By the triangle inequality, we have 



The approximation error 



-A 



f\ 



< 



2,P 



-fl 



+ II fl - fl 



2,P 



: (Approximation Error) 



Il2,p- 

(Sampling Error) 



(27) 



is a measure of the distance between ^ and the optimal approximation 

2,p P 

the difference 



Jx J\ 



A,z 



2,P 



given by algorithm ( 8 ) given infinite number of data. The sampling error term 

between f\ and f\ ^, depending on the data points. 

As typical in these types of estimates our proof consists of two parts: bounding the approximating error. 



fl 



in Lemma 



2,P 



and providing a concentration bound for 



J\.z Jx 



follows immediately by putting these two results together. 



2,P 



in Lemma |8| The theorem 

a 
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5.2.1 Bound for Approximation Error 

First of all, let present two lemmas that are useful for bounding the approximation error. 
Lemma 5. Let A > 0. If function f e W^|(IR'*) and p{x) > for any x e W^, then 

2 



arg mm 



f - K\[:^g + \Mi^ ) < D^t^ + \D^ m\ . 



for constants Di , D2 ■ 
Proof. See Appendix [A} 



(28) 



n 



Lemma 6. Let A > 0. If function f e VK|(A^) defined on a compact Riemann sub-manifold of d- dimension 
in a Euclidean space, then 



arg mm 



/-O ^_^ + A||5llL] <^ill/ll2.pi' + Ai?2|l/|l2,p. 



(29) 



for constants Di , D2 ■ 

Proof. See Appendix [B] D 

Now we can present the lemma that gives the bound of the approximation error in the following lemma. 

Lemm a 7. Let p, q be two density functions of probability measure over a domain X satisfying the assump- 
tions in 



The solution to the optimization problem in ji^, f^, satisfies the following inequality, 



(1) when the domain X is M.''', 



P 



2,p - ""' i^) 



sjl 



+ C2AI/2 



for constants Ci,C2 which are independent of X and t. 

(2) when the domain X is a compact Riemannian sub-manifold M of d dimension in M^, 



^' P 



<Cit + C2X^/^ 



2,P 



for constants Ci, C2 which are independent of X and t. 

Proof. Recall the equation ([8|. By functional calculus, we have analytical formula for /^ as follows, 



The last equation is because 
Thus the approximating error is 



^' P 



2:P 



^t,q^ — ^t,p- 



\ ,F / '^ p p 



2,P 



Notice that (/C^,j, + XX) /Cf ^ ^ in ( 30 ) can also be rewritten as 

(/C3^ + AZ)-'/C3p^=arg 
Thus, 



mm 
' g&L2,p 



Q 1^3/2 

- - /^tp 9 

p 



2,P 



AII5IIL 






2,P 



< min 

a<^L2,p 



Q ^-3/2 

p "P 



2,P 



+ M\9\\lp 



(30) 



(31) 
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The minimum of the above optimization problem can always be bounded by any specific g e i2,p- And 
we will expend the above formula such that we can take advantages of Lemma [5] and [6] To this end, we 
define an operator 



5a = T{f, A) = arg min / - K-l^^g 



2,P 



M\9\ 



2,P 



By functional calculus, it is not hard to see that g*^ = {K,t,p + XI) ICt^pf- If / G W|, so is g*^, this is 
because ICt^p is an integral operator with Gaussian kernel and Gaussian kernel is in W2 for any s > 0. Also, 
we should have Wglh.p < \\fh,p, because \\{JCt,p + XI)^'^K^t,p\\L2.p-^L2.p < 1- 

Now let gl = r(2,A) ,5^ = r(5i*,A),g| - r(<?2,A). We have glg*^ is also in Wi and ||52l|2,p < 
||5il|2,p < ||§|l2,p. Now we could expend ([3l|, 



mm 

geK2,p 



3/2 



P 



^t[p 9 



2,P 



A||.g|lL 



< 



3/2 






2,P 



A|!53ll^,p 



- - i^t,p 91 + i^t,p 91 - fit,p92 + /Ct,p5i - /Ct p gz 



2,P 



+ A||53lllp 



By inequality (a + 6 + c)-^ < 3(a^ + 1)^ + c^), we have 

2 

+ A\9\\lp 

2,P 
2 



mm 

g&K.2,p 



Q ;k-3/2 

- - ''^t'p 5 



<3 



<3 



f I^t,py92-K^l!p9*3) 2 +A||.93ll2,pj 

A||ffillL)+3c'^'f 51-O2 



/2 * 
,p i/2 



9 (^1/2 * 



+ 3c 



:+: j^l/2 * ■.11 * II 2 

92-l^t,p93 +A||g3||2.p 

2,P 



2 
2,P 



A|1.92ll2,p 



AII52II2, 



The last inequality is because ||/Ci.p||L2->-L2 < c for constant c > 1. Up to now, the proof is valid for both 
cases in the theorem. And we can apply Lemma [5] and [6] to get the bounds for both cases. By Lemma [5] for 
the densities p, q over M'', we have 



-1^3 9 



{Kl^ + \I) JClj, 



P P 



2,P 



< min 

g&K.2.p 



3/2 



K.t.,p 9 



2,P 



X\\9\\Ip<QcD, 



,pVlog(i) 



9cD2A^ 



q 

P 
(32) 



2,P 



Recall (30), we have 



fl-'^ 



<\ QcDi 



t" + 9CD2X 



2,P 



2 / ^ V2 

2,p Vl0g(x; 



(33) 



where Ci = 3. cDi 
Applying Lemma [6 



,C2-3,/cD 



2,P 



we will have the result for manifold case, 



fi-'- 



<j9cDi 



2,P 



i2+9cD2A 



2,P 



<Cit + CaA^/^ 



2,P 



(34) 



where Ci — 3. cDi 



,C2^3. cD 



2,P 



n 



13 



5.2.2 Bound for Sampling Error 

In the next lemma, we will give concentration of the sampling error, ||/^ — f\ z\\2,p- 

Lemma 8. Let p be a density of a probability measure over a domain X an d q another density function. 
They satisfy the assumptions in 4-1 Consider fj^ and fl^ defined in (|(Si) and (11), with confidence at least 
1 — 2e~'^ , we have 

\\rl rl \\ <^ /Wr _W^\ 

||/a - ixA\2.p ^ ^3 [j^ + XTAy^J 

where Kt = sup^g^ kt{x,x) = ^^Jy/2 

Proof. Recall that, 

f{ = arg min ||/C*,p/ - ICt,,l\\lp + \\\f\\l^ 

and 

/I,, = arg min i ^ ((/C.J)^) - (/C.,1)K))' + A||/||^^ 
J ^^t,z ri 

Using functional calculus, we will get the explicit formula for f\ and f\ ^ as follows, 
and 

fi. = {ic%+xiy'iciic.^i. 

Then the bound for sampling error is to bound the above two objects. Let / — ( /C^ + AX J /Cp/Cgl. We 
have fi - fl, ^fl-f + f- fl^. For fl - /, using the fact that (/C^ + Xl)fl= /C^/C^l, we have 

=fi~{jCl+Xiy\lCl + Xl)fi 

= (lCl^+Xiy\lCX~K.f)fi 
And 

/ ^ I\,z 

K.i^+xiy\iic,i - (K.i^+xiy\iK.i 



Notice that we have /C^ _ JQ^ and /C^ /C^^ — JCiK.q in the identity we get. For these two objects, it is 
not hard to verify the following identities, 

= {iCzp - ICp) + K.p (iCzp - ICp) + {K.zp - ICp) JCp {JCzj, - JCp) + (/Czp - K.p) JCp 
+ K.p (A^zp ~ ICp) + ICp [iCzp - ICp) ICp + [iCzp - ICp) ICp. 
And 

ICzpICz, -ICplCq 

= {ICzp - ICp) {iCz^ - ICq) + ICp {ICzp - ICp) {ICz^ - ICg) + {ICzp - ICp) ICp {ICz^ - ICq) 

+ ICp {iCzg - ICq) + {iCzp - ICp) ICq + ICp (iCzj, - ICp) ICq + (iCzp - ICp) ICplCq. 

Thus, in these two identities, the only two random variables are /C^ — ICp and JCz — ICq. By results 
about concentration of JCz and ICz , we have with probability 1 — 2e^'^, 

\\ICzp ~ ICpWu^H < ^, 

ll/C.,-/C,||«^«<^, (35) 






■'9 WV. ,/jj^ 
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And we know that for a large enough constant c which is independent of t and A, 



IK^pWn^-H < c, \\JCq\\-H^-H < c, 



{ic^ + m} 



H^H 



<^,||/C,l|l«<c 



and 



ixWn 



T.(aflxr\p^''' 



< sup 



>o(a3 + A)2 



E(-'-' 



<c' 



Ai/3 



thus, WfiWn < jf, 



Ai/6 p 



Notice that 



2,P 



{IC^^ -ICpY 



< IC^ 



/Cj, L. and 



{jc^^ -jCpY 



< /c. 



/Cj, L., both of this could 



be of smaller order compared with /Cz^ — /Cp L,. For simplicity we hide the term including them in the 
final bound without changing the dominant order. We could also hide the terms with the product of any 
two the random variables in Eq. |40[ which is of prior order compared to the term with only one random 
variable. Now let us put everything together, 

\\fi-fij2..p<c'/mi-fijH, 



<c 



72 / C Kt^/T 



<c. 




X\/rn 



where C3 = c^' ^ max 



2,P 



n 



6 Experiments 

In this section we explore the empirical performance of our methods under various settings. We will primarily 
concentrate on our setting Type II and use the same Gaussian kernel for the integral operator and the 
regularization term to simplify model selection. 

This section is organized as follows. In Subsection |6. 1 1 we describe a completely unsupervised procedure 
for parameter selection, which will be used throughout the experimental section. In Subsection |6.2| we briefly 
describe the data sets and the re-sampling procedures we use. In Subsection |6.3| we provide a comparison 
between our methods using different norms and other methods based on the expected performance under our 



evaluation criteria. In Subsection 6.4 we provide a number of experiments comparing our method to different 
methods on several different data sets for classification and regression tasks. Finally in Subsection |6.5| we 
study performance of different kernels in both Type-I and Type-II setting using two simulated data sets. 



6.1 Experimental Setting and Model Selection 
The setting: In our experiments, we have a set of a data set X^ = {. 



rP n — 



1,2, 



,n 



} and another set of 



, ,to}. The goal is to estimate - under the assumption that X^ is sampled 



instances X'^ — {x'j,j — 1,2, 
from p and X' is sampled from q. 

We note that our algorithms typically has two parameters, which need to be selected, the kernel width 
t and the regularization parameter A. In general choosing parameters in a unsupervised or semi-supervised 
setting is a hard problem as it may be difficult to validate the resulting classifier/estimator. However, 
certain features of our setting allow us to construct an adequate unsupervised proxy for the performance of 
the algorithm. Now we construct a performance measure for the quality of the estimator. 
Performance Measure. We describe a set of performance measures to use for parameter selection. 

For a given function u, we have the following importance sampling equality (Eq. [l]): 

p{x) 



Eq{u{x)) =Ep [u{x) 



If f{x) is an approximation of the true ratio -, using the samples from X^ and X' respectively, we will have 
the following approximation to the above equation: 



1 



1=1 



"«)/«) 



E-H)- 
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Therefore, after obtaining an estimate / of the ratio, we can vahdate it by using a set of test functions 
U = {ui,U2, ■ ■ ■ ,uf} using the following performance measure: 

J(/;XP,X^C/) = ^J2\ Y,<xl)f{xl)-Y^ui{x]) 1 (36) 

( = 1 \i = l 3=1 

where U — {ui, U2, ■ ■ ■ , uf} is a collection of function chosen as criterion. Using this performance measure 
allows various cross-validation procedures to be sued for parameter selection. 

We note that this way of measuring the error is related to the LSIF [l3j and KLIEP |29j, algorithms. 
However, there a similar measure is used to construct an approximation to the ratio fracqp using functions 
ui, . . . ,uf as a basis. In our setting, to choose parameters, we can use validations sets (such as linear 
functions) which are poorly suited as a basis for approximating the density ratio. 

Choice of validation function sets for parameter selection. In principle, any set of (sufficiently well- 
behaved) functions can be used as a validation set. From a practical point of view we would like functions 
to be simple to compute and readily available for different data sets. 

In the our experiments, we will use the following two families of functions for parameter tuning: 

(1) Sets of random linear functions u{x) — 0^x where /? ^ N{Q, Id). 

(2) Sets of random half-space indicator functions, u{x) — I^t^^q where /3 ^ N{0,ld). 

Remark 1: We have also tried (a) coordinates functions, (b) random combination of kernel functions, and 

(c) random combination of kernel functions with thresholding. In our experience the coordinate functions are 

not rich enough for adequate parameter tuning. On the other hand, using the kernel functions significantly 

increases the complexity of the procedure (due to the necessity of choosing the kernel width and other 

parameters) without increasing the performance significantly. 

Remark 2: Note that for linear functions, the cardinality of the set should not exceed the dimension of the 

space due to linear dependence. 

Remark 3: It appears that linear functions work well for regression tasks, while half-spaces are well-suited 

for classification. 

Procedures for parameter selection. 

We optimize the performance using cross-validation by splitting the data set in two parts X*''*'""™ and 
j^q,train ^gg^ fQj. training and X^''^'" and X'^-™ used for validation, and repeating this process five times to 
find the optimal values of parameter^ 

For the two parameters which need to be tuned, the kernel width t and the regularization parameter A, 
we specify a parameter grid as follows. The range for kernel width t is {to, 2to, • . ■ , 2^io), where to is the 
average distance of the 10 nearest neighbors, and regularization parameter A is (le — 5, le — 6, . . . , le — 10). 

6.2 Data sets and Resampling 

In our experiments, several data sets are considered: BankSFM, CPUsmall and KinSnm for regression; and 
USPS and 20 news groups for classification. 

For each data set, we assume they are i.i.d. sampled from a distribution denoted by p. We draw the first 
500 or 1000 points from the original data set as X^. To obtain X'^, we apply a resampling scheme on the 
remaining points of the original data set. Two ways of resampling, using the features of the data and using 
the label information, are used (along the lines similar to those proposed in [^). 

Specifically, given a set of data points with labels {{xi, yi), {x2, 2/2)1 ■ • • j (^^m Vn)} we resample as follows: 

• Resampling using feature information (labels y^ are not used). We subsample the data points 
so that the probability Pi of selecting the instance i, is defined by the following (sigmoid) function: 



P, 



1 + g(a(a:i,ei)-6)/CT„ 



where a, b are the resampling parameters, ei is the first principal component, and (t„ is the standard 
deviation of the projection to ei. Note that in this resampling scheme, the probability of taking one 
point is only conditioned on the feature information Xi. This resampling method will be denoted by 
PCA(a,5). 

*We note that this procedure cannot be used with KMM as it has no out-of-sample extension. Therefore in subsection |6.3| 
we do not compare our method with KMM since there is no obvious way to extend the results to the validation data set. 
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Resampling using label information. 

Pi, is defined by 



The probability of selecting the i'th instance, denoted by 
I Otherwise. 



where yi G L = {1,2, ... ,k} and Lq is a subset of the complete label set L. We apply this for binary 
problems obtained by aggregating different classes in the multi-class setting. 



6.3 Testing the FIRE algorithm 

In first experiment, we test our method for selecting the parameters, which is described in Section |6.1 



by focusing on the the error J{f;X'P,X'^,U) in Eq. 36 for different function classes U. We use different 



families of functions for tuning parameters and validation. This measure is important because in practice 
the functions we are interested may not be in the collection we chosen for validation. To avoid confusion, 
we denote the function for cross validation by /'^^ and the function for measuring error by f^" . 

We use the CPUsmall and USPS hand-written digits data sets. For each of them, we generate two data 
sets X'P and X'^ using the resampling method, PCA(a, cr„), describe in Section 6.2 We compare FIRE 



with several methods including TIKDE, LSIF. Figure [T] gives an illustration of the procedure and usage 
of data for the experiments. And the results are shown in Table [T] and [2J The numbers in the table are 
the average errors defined in Eq. 36 on held-out set X'^" over 5 trials, using different criterion functions 
/■^^ (Columns) and error- measuring functions /'^'^'^(Row). N is the number of random function we are using 
for the cross-validation. 



Fold 
1 



Fold 
2 



Fold 
k 



Figure 1: First of all Xp,X« are split into XP''''' and XP'°"', XI'"'' and X'?^<=". Then we further split Xp^''^ 
into k folds. For each fold i, density ratios at the sample points are estimated using only folds j ^ i and 
X'^''^'^ , and compute the error using fold i and X'>^^^ . We choose the parameter gives the best average error 
over the k folds of XP''^^ . And we measure the final performance using XP'^'^'^ and X'^''"'''. 

For the error-measuring functions, we have several choices as follows: 

(1) Linear: Sets of Random linear functions f{x) = 0^x where ji ^ N{0,ld). 

(2) Half-space: Sets of random half-space indicator functions, f{x) — I^t^^q where f3 ~ N{0,ld). 

(3) Kernel: Sets of random linear combination of kernel functions centered at the training data, f{x) — 'y'^ K 
where 7 ~ N{0,ld) and Kij — k(xi,Xj) where Xi are points from the data set. 

(4) K-indicator: Sets of random kernel indicator functions centered at the training data, / = l-^TKyo where 
7 ~ N{0,ld) and Kij = k{xi,Xj) where Xi are points from the data set. 

(5) Coord: Sets of coordinate functions. 
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Table 1: USPS data set with resampling using PCA(5,(Jt,), where a^ is the standard deviation of projected 
value on the first principal component. And \X'p\ — 500 and \X'i\ — 1371. Around 400 in X'^ and 700 in X'^ 
are used in 5-folds CV. 





Linear 


Half- 


spaces 




N=50 


N=200 


N=50 


N=200 


Linear 


TIKDE 


10.9 


10.9 


10.9 


10.9 


LSIF 


14.1 


14.1 


26.8 


28.2 


FIRE(L2.p) 


3.56 


3.75 


5.52 


6.32 


FIRE(L2,p + L2,,) 


4.66 


4.69 


7.35 


6.82 


FIRE(L2.,) 


5.89 


6.24 


9.28 


9.28 


Half-spaces 


TIKDE 


0.0259 


0.0259 


0.0259 


0.0259 


LSIF 


0.0388 


0.0388 


0.037 


0.039 


FIRE(i2,p) 


0.00966 


0.0091 


0.0103 


0.0118 


FIRE(L2,p + L2,,) 


0.0094 


0.0102 


0.0143 


0.0107 


FIRE(i2,9) 


0.0124 


0.0135 


0.0159 


0.0159 


Kernel 


TIKDE 


4.74 


4.74 


4.74 


4.74 


LSIF 


16.1 


16.1 


15.6 


13.8 


FIRE(L2,p) 


1.19 


1.05 


2.78 


3.57 


FIRE(L2.p + L2,,) 


2.06 


1.99 


4.2 


2.59 


FIRE(L2.g) 


5.16 


4.27 


6.11 


6.11 


K-Indicator 


TIKDE 


0.0415 


0.0415 


0.0415 


0.0415 


LSIF 


0.0435 


0.0435 


0.0531 


0.044 


FIRE(i2,p) 


0.00862 


0.00676 


0.0115 


0.0114 


FIRE(L2,p + L2.5) 


0.00559 


0.00575 


0.0191 


0.0108 


FIRE(L2,g) 


0.0117 


0.00935 


0.0217 


0.0217 


Coord. 


TIKDE 


0.0541 


0.0541 


0.0541 


0.0541 


LSIF 


0.0647 


0.0647 


0.139 


0.162 


FIRE(i2,p) 


0.0183 


0.0165 


0.032 


0.0334 


FIRE(L2,p + i2,,) 


0.0211 


0.0201 


0.0423 


0.0355 


FIRE(L2,g) 


0.0277 


0.0233 


0.0496 


0.0496 
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Table 2: CPUsmall data set with resampling using PCA(5, cr^), where a^ is the standard deviation of 
projected value on the first principal component. And \XP\ = 1000 and |X''| = 2000. Around 800 in X^ and 
1000 in X« are used in 5-folds CV. 





Linear 


Half- 


spaces 




N=50 


N=200 


N=50 


N=200 


Linear 


TIKDE 


0.102 


0.0965 


0.102 


0.0984 


LSIF 


0.115 


0.115 


0.115 


0.115 


FIRE(L2,p) 


0.0908 


0.0858 


0.0891 


0.0924 


FIRE(i2,p + L2,g) 


0.0832 


0.0825 


0.0825 


0.0718 


FIRE(L2,,) 


0.0889 


0.0907 


0.0932 


0.0899 


Half-spaces 


TIKDE 


0.00469 


0.00416 


0.00469 


0.00462 


LSIF 


0.00487 


0.00487 


0.00487 


0.00487 


FIRE(L2,p) 


0.00393 


0.00389 


0.00435 


0.00436 


FIRE(L2.p + L2,g) 


0.00385 


0.00383 


0.00383 


0.00345 


FIRE(i2,9) 


0.00421 


0.0044 


0.00459 


0.00427 


Kernel 


TIKDE 


9.82 


8.48 


9.82 


9.3 


LSIF 


9.6 


9.6 


9.6 


9.6 


FIRE(L2.p) 


6.96 


6.17 


8.02 


8.19 


FIRE(L2.p + L2,q) 


6.62 


6.62 


6.62 


6.35 


FIRE(L2,,) 


7.23 


7.17 


7.44 


7.38 


K-Indicator 


TIKDE 


0.00411 


0.00363 


0.00411 


0.00404 


LSIF 


0.00478 


0.00478 


0.00478 


0.00478 


FIRE(L2,p) 


0.0033 


0.00313 


0.0036 


0.00373 


FIRE(i2,p + L2,q) 


0.00306 


0.00306 


0.00306 


0.00288 


FIRE(L2,,) 


0.00358 


0.00354 


0.00365 


0.00366 


Coord. 


TIKDE 


0.00784 


0.0077 


0.00784 


0.00758 


LSIF 


0.00774 


0.00774 


0.00774 


0.00774 


FIRE(L2.p) 


0.00696 


0.00676 


0.00681 


0.00734 


FIRE(i2,p + L2,g) 


0.00647 


0.00637 


0.00637 


0.00584 


FIRE(i2,,) 


0.00693 


0.00692 


0.00699 


0.00689 
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6.4 Supervised Learning: Regression and Classification 

In our experiments, we compare our method FIRE with several methods under the setting of supervised 
learning, i.e. regression and classification. More specifically, we consider the situation part or all of the 
training set X^ are labeled and all of X'^ are unlabeled. In the following experiments, we will estimate 
the density ratio function using 1000 points in X^ and use the labeled data from X^ to build a regression 
function or classifier on X'^. 

6.4.1 Regression 

Given data sets {XP,YP) where X^ is for independent variable, and Y^ is for dependent variable, and a test 
data set X'^ with a different distribution, the regression problem is to obtain a function a predictor on X'^. 
To make the comparison between unweighted regression method and different weighting schemes, we use the 
simplest regression method, the least square linear regression. With this method, the regression function is 
of the form 

fix,/3) = p*x, 

where /3 — {XWX'^)^XWY and (•)+ denotes the pseudo-inverse of a matrix. Here VF is a diagonal matrix 
with the estimated density ratio on the diagonal. These are estimated using FIRE and other density ratio 
estimation methods for comparison. The results on 3 regression data sets are shown in Table [5] [3] and |4J 

Table 3: CPUsmall resampled using PCA(5,cri,), where (t„ is the standard deviation of projected value on 
the first principal component. \XP\ = 1000, |X«| = 2000. 



No. of Labeled 


100 


200 


500 


1000 


Weighting method 


Linear 


Half-spaces 


Linear 


Half-spaces 


Linear 


Half-spaces 


Linear 


Half-spaces 


OLS 


0.740 


0.497 


0.828 


0.922 


TIKDE 


0.379 


0.359 


0.299 


0.291 


0.278 


0.279 


0.263 


0.267 


KMM 


1.857 


1.857 


1.899 


1.899 


2.508 


2.508 


2.739 


2.739 


LSIF 


0.390 


0.390 


0.309 


0.309 


0.329 


0.329 


0.314 


0.314 


FIRE(L2,p) 


0.327 


0.327 


0.286 


0.286 


0.272 


0.272 


0.260 


0.260 


FIRE(L2,p + L2,g) 


0.326 


0.330 


0.285 


0.287 


0.272 


0.272 


0.261 


0.259 


FIRE(L2,,) 


0.324 


0.333 


0.284 


0.288 


0.271 


0.272 


0.261 


0.260 



Table 4: Kin8nm resampled using 


PCA(10,(T^) 


\XP\^ 


1000, |X«| = 


2000. 




No. of Labeled 


100 


200 


500 


1000 


Weighting method 


Linear 


Half-spaces 


Linear 


Half-spaces 


Linear 


Half-spaces 


Linear 


Half-spaces 


OLS 


0.588 


0.552 


0.539 


0.535 


TIKDE 


0.572 


0.574 


0.545 


0.545 


0.526 


0.529 


0.523 


0.524 


KMM 


0.582 


0.582 


0.547 


0.547 


0.522 


0.522 


0.514 


0.514 


LSIF 


0.565 


0.563 


0.543 


0.541 


0.520 


0.520 


0.517 


0.516 


FIRE(L2,p) 


0.567 


0.560 


0.548 


0.540 


0.524 


0.519 


0.522 


0.515 


FlRE{L2,p + L2,q) 


0.563 


0.560 


0.546 


0.540 


0.522 


0.519 


0.520 


0.515 


FlRE{L2,g) 


0.563 


0.560 


0.546 


0.541 


0.522 


0.519 


0.520 


0.515 



Table 5: Bank8FM resampled using PCA(l,cr,„). \X'p\ = 1000, \Xi\ = 2000. 



No. of Labeled 


100 


200 


500 


1000 


Weighting method 


Linear 


Half-spaces 


Linear 


Half-spaces 


Linear 


Half-spaces 


Linear 


Half-spaces 


OLS 


0.116 


0.111 


0.105 


0.101 


TIKDE 


0.111 


0.111 


0.100 


0.100 


0.096 


0.096 


0.092 


0.092 


KMM 


0.112 


0.161 


0.103 


0.164 


0.099 


0.180 


0.095 


0.178 


LSIF 


0.113 


0.113 


0.109 


0.109 


0.104 


0.104 


0.099 


0.099 


FIRE(L2,p) 


0.110 


0.110 


0.101 


0.102 


0.097 


0.097 


0.093 


0.094 


FIRE(L2,p + L2,g) 


0.113 


0.110 


0.103 


0.102 


0.099 


0.097 


0.097 


0.094 


FlRE{L2,g) 


0.112 


0.118 


0.102 


0.106 


0.099 


0.103 


0.096 


0.102 



6.4.2 Classification 

Similarly to the case of regression the density ratio can also be used for building a classifier such as SVM. 
Given a set of labeled data, {{xi,yi),{x2,y2), • • • , ixn,yn)} and Xi ~ q, we building a linear classifier / by 
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the weighted hnear SVM algorithm as follows: 



C 



/ = arg min — 



i=l 



w,iP^x,-y,)+ + \\/3\\ 



The weights w^'s are obtained by various density ratios estimation algorithms using two data sets X^ and 
X"^. Note that estimating the density ratios using Xp and X' is completely independent of the label 
information. We also explore the performance of these weighted SVM as the number of labeled points used 
for training classifier changes. In the experiments, we first estimate the density ratios on the whole X^ with 
the parameters selected by cross validation. Then we subsample a portion of X^ and use their labels to train 
the classifier. And the performance of the classifier in terms of prediction error is estimated using all the 
points in X"^. The results on USPS hand- written digits and 20 news groups are shown in Table |6J [7J [8] and 

IS 



Table 6: USPS resampled using Feature information, PCA(5,crt,), where (Jy is the standard deviation of 
projected value on the first principal component. \XP\ = 1000 and |A"''| — 1371, with — 4 as —1 class and 
5 — 9 as +1 class. 



No. of Labeled 


100 


200 


500 


1000 


Weighting method 


Linear 


Half-spaces 


Linear 


Half-spaces 


Linear 


Half-spaces 


Linear 


Half-spaces 


SVM 


0.102 


0.081 


0.057 


0.058 


TIKDE 


0.094 


0.094 


0.072 


0.072 


0.049 


0.049 


0.042 


0.042 


KMM 


0.081 


0.081 


0.059 


0.059 


0.047 


0.047 


0.044 


0.044 


LSIF 


0.095 


0.102 


0.073 


0.081 


0.050 


0.057 


0.044 


0.058 


FIRE(L2,p) 


0.089 


0.068 


0.053 


0.050 


0.041 


0.041 


0.037 


0.036 


FIRE(L2,p + L2,g) 


0.070 


0.070 


0.051 


0.051 


0.041 


0.041 


0.036 


0.036 


FIRE(L2,9) 


0.055 


0.073 


0.048 


0.054 


0.041 


0.044 


0.034 


0.039 



Table 7: USPS resampled based on Label information, V* only contains point with labels in L' = {0, 1, 5, 6}. 
The binary classes are with -1-1 class= {0,1,2,3,4}, -1 class= {5,6,7,8,9}. And \XP\ = 1000 and |X«| = 
2000. 



No. of Labeled 


100 


200 


500 


1000 


Weighting method 


Linear 


Half-spaces 


Linear 


Half-spaces 


Linear 


Half-spaces 


Linear 


Half-spaces 


SVM 


0.186 


0.164 


0.129 


0.120 


TIKDE 


0.185 


0.185 


0.164 


0.164 


0.124 


0.124 


0.105 


0.105 


KMM 


0.175 


0.175 


0.135 


0.135 


0.103 


0.103 


0.085 


0.085 


LSIF 


0.185 


0.185 


0.162 


0.163 


0.122 


0.122 


0.108 


0.108 


FIRE(L2,p) 


0.179 


0.184 


0.161 


0.161 


0.115 


0.120 


0.107 


0.105 


FIRE(L2,p + L2,q) 


0.180 


0.185 


0.161 


0.162 


0.116 


0.120 


0.106 


0.107 


FIRE(L2,,) 


0.183 


0.184 


0.160 


0.162 


0.118 


0.120 


0.106 


0.103 
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Table 8: 20 News groups resampled using Feature information, PCA(5, a^), where a^ is the standard devi- 
ation of projected value on the first principal component. \XP\ — 1000 and jX"^! = 1536, with {2, 4, . . . , 20} 
as —1 class and {1,3, . . . , 19} as +1 class. 



No. of Labeled 


100 


200 


500 


1000 


Weighting method 


Linear 


Half-spaces 


Linear 


Half-spaces 


Linear 


Half-spaces 


Linear 


Half-spaces 


SVM 


0.326 


0.286 


0.235 


0.204 


TIKDE 


0.326 


0.326 


0.286 


0.285 


0.235 


0.235 


0.204 


0.204 


KMM 


0.338 


0.338 


0.303 


0.303 


0.252 


0.252 


0.242 


0.242 


LSIF 


0.329 


0.325 


0.297 


0.285 


0.238 


0.235 


0.210 


0.204 


FIRE(L2,p) 


0.314 


0.324 


0.276 


0.278 


0.231 


0.234 


0.202 


0.210 


FIRE(L2,p + 1/2,9) 


0.315 


0.323 


0.276 


0.277 


0.232 


0.233 


0.200 


0.208 


FIRE(L2,9) 


0.317 


0.321 


0.277 


0.275 


0.232 


0.231 


0.197 


0.207 



Table 9: 20 News groups resampled based on Label information, X'^ only contains 
L' = {1, 2, . . . , 8}. The binary classes are with -f 1 class= {1, 2, 3, 4}, —1 class= {5, 6, . 
and IX"?! =4148. 



point with labels in 
..,20}. \XP\ = 1000 



No. of Labeled 


100 


200 


500 


1000 


Weighting method 


Linear 


Half-spaces 


Linear 


Half-spaces 


Linear 


Half-spaces 


Linear 


Half-spaces 


SVM 


0.354 


0.333 


0.300 


0.284 


TIKDE 


0.354 


0.353 


0.334 


0.335 


0.299 


0.298 


0.281 


0.285 


KMM 


0.368 


0.368 


0.341 


0.341 


0.295 


0.295 


0.270 


0.270 


LSIF 


0.353 


0.354 


0.336 


0.334 


0.304 


0.305 


0.286 


0.284 


FIRE(L2,p) 


0.347 


0.348 


0.334 


0.332 


0.303 


0.300 


0.282 


0.277 


FIRE(L2,p + 1/2,9) 


0.348 


0.348 


0.332 


0.332 


0.301 


0.301 


0.277 


0.277 


FIRE(L2,9) 


0.347 


0.349 


0.330 


0.330 


0.303 


0.300 


0.284 


0.278 



6.5 Simulated Examples 

6.5.1 Simulated Dataset 1. 

We use a simple example, where the two densities are known, to demonstrate the properties of our methods 
and how the number of data points influences the performance. 

For this experiment, we suppose p = 0.57V(-2, l2)-|-0.5A^(2, 0.5^) and q = N{0,0.5'^) and fix |X9| = 2000, 
and vary \Xp\ from 50 to 1000. We compare our method with the other two methods for the same problem: 
TIKDE and KMM. For all the methods we consider in this experiment, we will choose the optimal parameter 
based on the empirical L2 norm of the difference between the estimated ratio and the true ratio, which is 
supposed to be known in this toy example. Figure [2] gives the reader an intuition about how the estimated 
ratios behave for different methods. 






(a) TIKDE 



(b) FIRE 



(c) KMM 



Figure 2: Plots of estimation of the ratio of densities with jX^j = 500 of points from p — 0.5N{~2, 1^) + 
0.5iV(2,0.5^) and |X«| = 2000 points from q = A^(0,0.5^). The blues lines are true ratio, 2. Left column is 
the estimations from KDE with proper chosen threshold. Middle column is the estimations from our method, 
FIRE. And right one is the estimation from KMM. 

And Figure [3] shows how different methods perform when \XP\ varies from 50 to 1000 and \X'^\ is fixed 
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to be 2000. The boxplot is also a good way to illustrate the stability of the methods over 50 independent 
repetitions. 



80 



70 
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10 



[*' 



i i 
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n=50 



n=100 



n=300 



n=500 



n=1000 



Figure 3: Number of points from p, n varies from 50 to 1000 as the horizontal axis indicates, and the number 
of points from q is fixed to be 2000. For each n, the three bars, from left to right, belongs to TIKDE, 
FIRE(marked as red) and KMM. 



6.5.2 Simulated Dataset 2. 

In the second simulated example, we will test our method for various kernels and different norms as the cost 
function. More specifically, we suppose p = iV(0, 0.5^) and q = Unif([— 1, 1]). We will use this example to 
explore the power of our methods with different kernels. Three settings are considered in this experiments: 
(l)Different kernels kh for the RKHS. We use polynomial kernels of degree 1, 5 and 20, exponential kernel 
and Gaussian kernel; (2) Type-I setting and Type-II setting; (3) Different norm for the cost function in the 
algorithm, i.e. || • ||2,p and || • ||2,g. In this example, || • ||2,p focuses on the region close 0, but still has penalty 
outside interval [—1, 1]; || • ||2,p has uniform penalty on [—1, 1] and has no penalty at all outside the interval. 
In all settings, we fix the convolution kernel to be Gaussian, kt- When the RKHS kernel is exponential 
and Gaussian, we also need to decide their width. For simplicity, we just fix their width to be 20t, where 
t is the width of the convolution kernel kt- For setting Type-I, we will set \XP\ = 500 and \X''\ = 500; for 
Type-II setting, we only specify \Xp\ ~ 500. The results are shown in Figure l4J 
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Of Deg 1 
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of Deg 5 
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Exp. 


Kernel 




(a) p.d.f. of the two density wc considered. 



(b) t = 0.05, A = le - 3, various kernel for RKHS. 



True Ratip 
Norm 





(c) t = 0.03, A = le — 5, Gaussian, Type-I 



(d) t = 0.03, A = le - 5, Gaussian, Type-II 



Figure 4: Estimating the ratio between p =Unif([— 1, 1]) and q =N(0, 0.5). (a) shows the p{x) and q{x). The 
blue Unes in the rest subfigures is the true ratio -. In (b), different kernels for RKHS are used with t = 0.05 
and A = le — 3. In (c), we suppose two samples from p and q are available, thus Type-I setting is used. 
And we use Gaussian as RKHS kernel with fix the kernel width t — 0.03 and the regularization parameter 
A = le — 5 and the L2 norm for the cost function. In (d), we suppose it is Type-II setting, thus X^ is 
available and the function q is known. Besides this, (d) use the same parameters with (c). 
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A Proof for Lemma [5] 

Proof. RKHS is unique for a given domain and kernel, so is independent of the measure used to define the 

1/2 1/2 

L2.p. Thus for any g G i2,p, there should he h G L2 such that Lj 5 = Lj h and 
Since this is true for arbitrary g e -/j2.p, we have 
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/-L,Y;<? +A'||g 
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Because |1 • Ha.p < r|| • II2, 
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f-L^^h +X'\\h\\'i <r^ ( min 
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(37) 



To bound 
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/ieL2 



/ - i^y'h 



2 A' 
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r2n'^ll2 



We need the Fourier transform F : L2{M.'^) -^ L2{M.'^), defined as 
Lt on R'^ is the heat operator, thus L/ = L*. And 



Lt/(a;) = / kt{x,y)f{y)dy^{kt*f){x), 



R'i 



So, F{Ltf) — ktf. Note that F is an isometry. Thus it is the same to transform the (37) using Fourier 

transform. Then we have 

. . , 2 X' 

f — kih 
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2 r2 
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where kt{^) — e 2 . And let 



m 



'/(c) ifiicf<^^^^ 

Otherwise. 



-1 , 



and h= (kt) /. It is obvious that \\h\\j < ^\\f\\j = ^WfWl- And 



f — kth 



f-f 



Now we recall the definition of Sobolev space using Fourier transform, which states that /(^) 
for some u G L2. Thus, 



(i+|14IP)=/2 



HO 



f-f 

Thus, we have 



CP<-°:'^' (l + llelP)^/^"^^^'^V "4ll^<^^^^ ((l + IICP)^/^^^^7 '^^ - 4a« (log(i))^ 
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2 A' 



/l||?< 



4a^(log(i))^ 



I"ll2 + ^ll/ll2 



Let Di 



— and Z?9 = 1, we have the lemma. 
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B Proof for Lemma [6] 

For the compact manifold case, we also need to have similar lemma as the above one. However, the definition 
of Fourier transform is obscure, thus we need to consider alternative way to get the same bound. We can use 
the Laplace-Beltrami operator on the compact manifold. It has discrete spectrum and satisfies the Weyl's 
Law, see Chapter 8 in [5D], about the spectrum of the Laplace-Beltrami operator A, which is discrete if the 
manifold is compact. It states the following: the number of eigenvalues of Laplacian operator over a bounded 
domain with Neumann Bounday condition that are less or equal than a;, denoted by N(x), satisfies 

,. N{x) 

a;— >oo X ' 

for a constant C depending on the dimensionality and volume of the domain. This implies there exists M 
such that for any i > M, 

CX^I'' <m< C2l^'''- 

Also, we can redefine the Sobolev space on a compact manifold using Laplace-Beltrami operator. 



W^ = {/ e L2 



A^/2/ 



< CX)}. 



And this definition of W| is equivalent to common definition of Sobolev space using differentiation, see |37] 
for the details for this equivalence. 
First we need the following lemma. 



Lemma 9. Suppose f e W2 and Nt 



, then we have 



i>Nt 



where Vi is the eigenf unctions of Laplacian operator A and C is a constant independent oft. 
Proof. First let proof that 



E(/'^^ 



2 ■2s/d 



< 00. 



Using the implication of Weyl's Law, we have for i > M, i^*/'^ < 2i_. Thus, 



5: (/,.,)^-/'^ =$:(/, 
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Let E^o (/' «')2 *''/'' = C. For Nt = (i)". We have 
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Isjd 



E(/,«.)2< E(/'«')2*''^'<c. 



i>Nt 



i>Nt 



Thus, 



J2{f,v,)U^-ct'-'/'. 



i>Nt 



N, 



2s/d 



Now we can give the proof for Lemma [6J 
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Proof. RKHS is unique for a given domain and kernel, so is independent of the measure used to define the 

1/2 1/2 

L2,p. Thus for any g e 1^2,p, there should he h G L2 such that L^ ^ g — L^ h and 



Ilffll2,: 



- Ill 1/2 



1/2, 



Lt'Jg\\H.^\\K''h\\H, = \\h\\2 
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Since this is true for arbitrary g e ^2,p, we have 
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Expend / using the eigenfunctions v^^vi^ . . . of A, we have 

oc 
i=0 

Denote the eigenvalues of A as 770, ?7i, . . . , the heat operator defined as Ht — e^^* having eigenvalues as 
Q-vot^ g-i)it^ _ _ Recall the Weyl's law, we have there exists M such that for any i > M, cii'^^'^ < ?7i < 021^^'^ ■ 
When t is small enough, we will have Nt = j^ > M . Since we order rji in non-decreasing order, for any 

i < Nf, we have rji < rjNt ^ C2A'j — C2/t, also e"**'* > e^"^^. Now denote Pn be the operator that projects 
function f E L2 to the subspace spanned by first N eigenfunctions of A. Thus 

i<Nt 



where Vi is the eigenfunction of A. And let 
Thus, we have 
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Now 



let us proceed by bound the above formula. By Lemma 9 with N^ — j^ and s = 2, we have 



i=Nt + l 

Also, for i < Nt, e'''* < e=^ thus \\H-'^/^PnJ\\2 < e"^/^ ||-PwJ|l2 < e''^^^ 



1/2 
L/ ||i2->L2 ^ C'ifor a constant C". Thus, we have 



12- 



Recall we have llH 



1/2 



H^/2H-^/2p^j_ i^^H-^'^P^J < C'e^^/2 11/11, t 



For the third term in ( 38 1 , we have 



Nt Nt 



Hence, 



/ - 0^|[ + A'li^Mllp < r^ (ct^ + c'e^^/2 ii^ii^ ^y^ y^c2 ii^ii 



When t is small enough, t^ < t, letting Di = r^(C + C'e'^^^^),D2 = e'^^, we prove the lemma. 
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C Proof for Theorems in 14.2.2 



In the second case, since we do not have samples from q, we replace ICt q z by q. Consider corresponding 
Jx ' 



Thus, we need to bound the extra term (/C^p + AI)^^/C^p iq — JCt^p^ ). Let d = q — JCt^p^ and ||rf||2,p = ^t, 
we have 






E 



<7f{Ui,d)2,p 

crf + X 



< max 



cr>0 \ a 



St< 



(.: 



3 +2"t As 



< A^s^, 



The bound for 6t is given in the following lemma. 



Lemma 10. Suppose p,q are two den sity function of probability measures of the domain Q and satisfying 
the assumptions we gave in section 4-1 We have the following: (1) When O is M'^ and q G VF|(M^), we have 



K- 



■t,p 



P 



0{t) 



2,P 



(2) When Omega is a manifold M. without boundary of d dimension and q G W2{M.), we have 

q 



^t,p- — q 
P 



o{t'-n 



2,P 



for any < e < 1. 

Proof. By definition of ICt^p, we have 

}Ct.p--q= ktix,y)——p{y)dy-q{x)^ kt{x,y){q{y) - q{x))dy = {K.t -T)q 
P Jw P[y) Jri 

By results in [?], we have {ICt —I)q = tAq + o{t) when q is twice differentiable. Due to g S W|(]R'^), we have 
||Ag||2 < oo. Thus, we have 

\\ICt,p- - qh,p < niCt,p- - qh = r||iAg + o(t)|J2 < n\\Aqh + o{t) = 0{t). 

For manifold case, we have {ICt — 'D)q = tAq + o{t), where Vf = Jj^ kt{x, y)dyf{x). Thus, 

{ICt~I)q^{ICt-V)q+{V-I)q. 
For the first term, we have the same rate with M"^. Now we proceed by bounding the second term. 



\{T^-^)qh 



kt{-,y)dy - 1 q{-) 



M 



< 



kt{-,y)dy- 1 



M 



We know that \\q\\2 < oo. 

Let Bt{x) = {y & M. : \\x — y\\2 < t^~^} and Rt{x) is the projection of Bt{x) on the T^^M.. In the 
following proof, we need to use change of variables to converting integral over a manifold to the integral over 
the tangent space at a specific point. For two points x,y (£ Ai, let y' = '!^x{y) be the projection of y in the 
tangent space T^ of M at x. Let Jtt^I,, denote the Jacobian of the map -Kx at point y E Ai and J^-i 
the inverse. For y sufficiently close to x, we have 

\\x-y\\^\\x~y'\\+0{\\x-y'f) 
^^J.-l =0{\\x-y\\') 



J.-i 



- 1 



= 0{\\x-y'r). 
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Thus, it is true that the points in Rt{x) are still no further than 2t^^'^, when t is small enough. Since kt 
has exponential decay, the integral /g i^) ktiy, ■)dy is of order 0(e^* ), and so is /r (j.\ kt{y' , ■)dy' . Thus, 
for any point a; G A^, 



/ct(x, ■)dx — 1 



M 



kt{y,-)dy-l + 0{e-''') 



Bt(x) 



kt{y,-)dy- kt{y\-)dy' + 0{e~'' ) 

Bt{x) JT^M 



h{v'r)J^-Ay'dy' - / kt{xr)dx + 0{e * ) 

B.(x) JFI.(x) 



kt(x,-)(J^-i\:,-l)dx + 0{e~* ') 

Rix) 

0{t^-^') f ktix,-)dx + 0{e-'~') 

Jr{x) 



=0{t'-''') / kt{x, ■)dx + 0(e-*"') + 0(e-*"') 
Abusing the notation of e, we have || jj^ kt{-,y)dy — 1||2 < 0{t^~^) where < £ < 1. 

'A ^ JA.zlb.pi 



For the concentration of ||/^^ — fj^z\\2,'pi ^'^ '^^iU consider their close formulas 



/L- ^L + AX IClq 



n 



(39) 



By the similar argument to that in Lemma [8J we will have the following lemma gives the concentration 
bound. 

Lemma 11. Let p be a density of a probability measure over a domain X and q another density function. 
They satisfy the assumptions in 4-1 Consider f^ and f^^^ defined in Eq. 39 with confidence at least l—2e~'^, 
we have 



I fii fii 
\j\ ~ J\,z 






<c. 



where Kt = supj.gj2 kt{x, x) 



{2TTtY/^ 



Proof Let / = ( /C^ ^ + Xl] K^q. We have /" - /", - /" - / + / - /i^ . For /" - /, using the fact that 



ff - f 



[JCl + XL) ff = JClq, we have 






And 



/ /a,z 

= {k.1^ +xiy\iq- {ki^ +xiy\ic 



ICl + XI) (jCl -K,l\q 
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Notice that we have /C^ ~ ^p ^'^^d /C^ — /Cp in the identity we get. For these two objects, it is not hard 
to verify the following identities, 

= {iCzp - ICp) + K.p (iCzp - ICp) + {K.zp - ICp) JCp {JCzj, - ICp) + (iCzj, - ICp) JCp 



And 



K.p [JCzp - K,p) + JCp [K.zp - ICp) ICp + [iCzp - ICp) ICp. 



I^'ip -1^1 = {l^zp - ICp) + ICp (iCzp - ICp) + (/Czp - ICp) ICp 

Thus, in these two identities, the only two random variables are ICz ~ICp. By results about concentration 
of ICz and ICz , we have with probability 1 — 2e~'^ , 



< 



'It 



II J' I- 1 1 n ^ ,,j 

And we know that for a large enough constant c which is independent of t and A, 



I — 1 

/^tlklloc 



(40) 



l^plk^w < c, 



(^^,+A^)" 



u^n 



< y\\K^p<l\\n < chhp 



and 



!/;■!& - T. ^^. i^'-f i (™p o?Ta? ) £ <'• "'>' S T Mi, 



thus, ii/;"ii« < ji, II, 



Notice that 



2,p- 



{ICzp -ICp)' 



< \\ICz — ic 



■H 



pW-h 



and 



{iCzp -ICpY 



< \\ICz — IC 



■H 



pW-h 



both of this could 



be of smaller order compared with \\ICzp — ICpUn,- For simplicity we hide the term including them in the 
final bound without changing the dominant order. We could also hide the terms with the product of any 
two the random variables in Eq. |40[ which is of prior order compared to the term with only one random 
variable. Now let us put everything together, 

II fll fll II <- ^1/2 II fll fll I 

II/a - J\zh,p<C ' \\.tx - J\,z\\Ht 

/2 fc^Kt\^ ||„|| , c'^Kt\^ 



<cV2('i^lMI., + ^^ll<z 



<C4 



VA3/2^ 



X\/n 



.A3/2^/^ A^^: 

where C4 = C'l'^ max [c \\q\^p , ||<7||oo j • 

Given the above lemmas, the main theorem for the second case follows. 
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